7 research outputs found

    Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

    Get PDF

    Addressing variability in reuse prediction for last-level caches

    Get PDF
    Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. Problematically, technology constraints make it infeasible to scale LLC capacity to meet the ever-increasing working set size of the applications. Thus, future processors will rely on effective cache management mechanisms and policies to get more performance out of the scarce LLC capacity. Applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks that will not be referenced again, leading to inefficient utilization of the LLC capacity. To improve cache efficiency, the state-of-the-art cache management techniques employ prediction mechanisms that learn from the past access patterns with an aim to accurately identify as many dead blocks as possible. Once identified, dead blocks are evicted from LLC to make space for potentially high reuse cache blocks. In this thesis, we identify variability in the reuse behavior of cache blocks as the key limiting factor in maximizing cache efficiency for state-of-the-art predictive techniques. Variability in reuse prediction is inevitable due to numerous factors that are outside the control of LLC. The sources of variability include control-flow variation, speculative execution and contention from cores sharing the cache, among others. Variability in reuse prediction challenges existing techniques in reliably identifying the end of a block's useful lifetime, thus causing lower prediction accuracy, coverage, or both. To address this challenge, this thesis aims to design robust cache management mechanisms and policies for LLC in the face of variability in reuse prediction to minimize cache misses, while keeping the cost and complexity of the hardware implementation low. To that end, we propose two cache management techniques, one domain-agnostic and one domain-specialized, to improve cache efficiency by addressing variability in reuse prediction. In the first part of the thesis, we consider domain-agnostic cache management, a conventional approach to cache management, in which the LLC is managed fully in hardware, and thus the cache management is transparent to the software. In this context, we propose Leeway, a novel domain-agnostic cache management technique. Leeway introduces a new metric, Live Distance, that captures the largest interval of temporal reuse for a cache block, providing a conservative estimate of a cache block's useful lifetime. Leeway implements a robust prediction mechanism that identifies dead blocks based on their past Live Distance values. Leeway monitors the change in Live Distance values at runtime and dynamically adapts its reuse-aware policies to maximize cache efficiency in the face of variability. In the second part of the thesis, we identify applications, for which existing domain-agnostic cache management techniques struggle in exploiting the high reuse due to variability arising from certain fundamental application characteristics. Specifically, applications from the domain of graph analytics inherently exhibit high reuse when processing natural graphs. However, the reuse pattern is highly irregular and dependent on graph topology; a small fraction of vertices, hot vertices, exhibit high reuse whereas a large fraction of vertices exhibit low- or no-reuse. Moreover, the hot vertices are sparsely distributed in the memory space. Data-dependent irregular access patterns, combined with the sparse distribution of hot vertices, make it difficult for existing domain-agnostic predictive techniques in reliably identifying, and, in turn, retaining hot vertices in cache, causing severe underutilization of the LLC capacity. In this thesis, we observe that the software is aware of the application reuse characteristics, which, if passed on to the hardware efficiently, can help hardware in reliably identifying the most useful working set even amidst irregular access patterns. To that end, we propose a holistic approach of software-hardware co-design to effectively manage LLC for the domain of graph analytics. Our software component implements a novel lightweight software technique, called Degree-Based Grouping (DBG), that applies a coarse-grain graph reordering to segregate hot vertices in a contiguous memory region to improve spatial locality. Meanwhile, our hardware component implements a novel domain-specialized cache management technique, called Graph Specialized Cache Management (GRASP). GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. To reliably identify hot vertices amidst irregular access patterns, GRASP leverages the DBG-enabled contiguity of hot vertices. Our domain-specialized cache management not only outperforms the state-of-the-art domain-agnostic predictive techniques, but also eliminates the need for any storage-intensive prediction mechanisms

    Domain-Specialized Cache Management for Graph Analytics

    Get PDF
    Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected, hot, vertices inherently exhibit high reuse. However, this work finds that state-of-the-art hardware cache management schemes struggle in capitalizing on their reuse due to highly irregular access patterns of graph analytics. In response, we propose GRASP, domain-specialized cache management at the last-level cache for graph analytics. GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. GRASP keeps hardware cost negligible by leveraging lightweight software support to pinpoint hot vertices, thus eliding the need for storage-intensive prediction mechanisms employed by state-of-the-art cache management schemes. On a set of diverse graph-analytic applications with large high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the best-performing prior scheme. GRASP remains robust on low-/no-skew datasets, whereas prior schemes consistently cause a slowdown.Comment: No content changes from the previous versio

    A Closer Look at Lightweight Graph Reordering

    Get PDF
    Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected (hot) vertices inherently exhibit high reuse. However, their sparse distribution in memory leads to a severe underutilization of on-chip cache capacity. Prior works have proposed lightweight skew-aware vertex reordering that places hot vertices adjacent to each other in memory, reducing the cache footprint of hot vertices. However, in doing so, they may inadvertently destroy the inherent community structure within the graph, which may negate the performance gains achieved from the reduced footprint of hot vertices. In this work, we study existing reordering techniques and demonstrate the inherent tension between reducing the cache footprint of hot vertices and preserving original graph structure. We quantify the potential performance loss due to disruption in graph structure for different graph datasets. We further show that reordering techniques that employ fine-grain reordering significantly increase misses in the higher level caches, even when they reduce misses in the last-level cache. To overcome the limitations of existing reordering techniques, we propose Degree-Based Grouping (DBG), a novel lightweight reordering technique that employs a coarse-grain reordering to largely preserve graph structure while reducing the cache footprint of hot vertices. Our evaluation on 40 combinations of various graph applications and datasets shows that, compared to a baseline with no reordering, DBG yields an average application speed-up of 16.8% vs 11.6% for the best-performing existing lightweight technique.Comment: Fixed ill-formatted page 6 from the earlier version. No content change

    Reconsidering OS Memory Optimizations in the Presence of Disaggregated Memory

    Get PDF
    Tiered memory systems introduce an additional memory level with higher-than-local-DRAM access latency and require sophisticated memory management mechanisms to achieve cost-efficiency and high performance. Recent works focus on byte-addressable tiered memory architectures which offer better performance than pure swap-based systems. We observe that adding disaggregation to a byte-addressable tiered memory architecture requires important design changes that deviate from the common techniques that target lower-latency non-volatile memory systems. Our comprehensive analysis of real workloads shows that the high access latency to disaggregated memory undermines the utility of well-established memory management optimizations Based on these insights, we develop HotBox – a disaggregated memory management subsystem for Linux that strives to maximize the local memory hit rate with low memory management overhead. HotBox introduces only minor changes to the Linux kernel while outperforming state-of-the-art systems on memory-intensive benchmarks by up to 2.25×
    corecore